A common workflow for data validation in pointblank involves three basic components:
We always start with create_agent() and define how the
data can be reached and also provide some basic rules about how an
interrogation of how that data should eventually be carried out. While
we are giving the agent some default behavior, we can override some of
this on a step-by-step basis when declaring our validation steps. We
always end with interrogate() and that function carries out
the work of validating the data and generating the all-important
reporting.
small_tableThe package contains a few datasets. A really small one for
experimenting is called small_table:
pointblank::small_table
## # A tibble: 13 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
Let’s create an agent, give it the
small_table, and look at the report.
# Create the agent with `create_agent()`; the `tbl` is given to the agent
agent_1 <-
create_agent(
tbl = small_table,
tbl_name = "small_table",
label = "Workshop agent No. 1",
)
# Printing the `agent` will print the report with the default options
agent_1
Pointblank Validation PlanNo Interrogation Performed
|
|||||||||||||
| Workshop agent No. 1 tibble
small_table |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Okay. Let’s provide a few validation functions.
agent_1 <-
agent_1 %>%
col_vals_gte(columns = d, value = 0) %>%
col_is_logical(columns = e) %>%
col_is_character(columns = c(b, f)) %>%
col_is_numeric(columns = d) %>%
col_vals_in_set(columns = f, set = c("low", "mid", "high")) %>%
rows_distinct()
agent_1
Pointblank Validation PlanNo Interrogation Performed
|
|||||||||||||
| Workshop agent No. 1 tibble
small_table |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_vals_gte()
|
||||||||||||
| 2 | col_is_logical()
|
— | |||||||||||
| 3 | col_is_character()
|
— | |||||||||||
| 4 | col_is_character()
|
— | |||||||||||
| 5 | col_is_numeric()
|
— | |||||||||||
| 6 | col_vals_in_set()
|
||||||||||||
| 7 | rows_distinct()
|
— | — | ||||||||||
The report contains the information about the validation steps but
many of the table cells have no data. That area is the interrogation
data, and, we haven’t yet used the interrogate() function.
Let’s use it now.
agent_1 <- agent_1 %>% interrogate()
agent_1
| Pointblank Validation | |||||||||||||
| Workshop agent No. 1
tibble
small_table
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_vals_gte()
|
|
✓ |
13 |
131.00 |
00.00 |
— |
— |
— |
— | |||
| 2 | col_is_logical()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
— |
— |
— |
— | ||
| 3 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
— |
— |
— |
— | ||
| 4 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
— |
— |
— |
— | ||
| 5 | col_is_numeric()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
— |
— |
— |
— | ||
| 6 | col_vals_in_set()
|
|
✓ |
13 |
131.00 |
00.00 |
— |
— |
— |
— | |||
| 7 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
— |
— |
— |
||
| 2022-10-13 00:03:53 EDT < 1 s 2022-10-13 00:03:53 EDT | |||||||||||||
Now, we see a report we can use! Let’s go over each of the columns and understand what they mean.
STEP: the name of the validation function used for
the validation step and the step number.
COLUMNS: the names of the target columns used in the
validation step (if applicable).
VALUES: the values used in the validation step,
where applicable; this could be as literal values, as column names, an
expression, etc.
TBL: indicates whether any there were any changes to
the target table just prior to interrogation. A rightward arrow from a
small circle indicates that there was no mutation of the table. An arrow
from a circle to a purple square indicates that ‘preconditions’ were
used to modify the target table. An arrow from a circle to a half-filled
circle indicates that the target table has been ‘segmented’.
EVAL: a symbol that denotes the success of
interrogation evaluation for each step. A checkmark indicates no issues
with evaluation. A warning sign indicates that a warning occurred during
evaluation. An explosion symbol indicates that evaluation failed due to
an error. Hover over the symbol for details on each condition.
UNITS: the total number of test units for the
validation step
PASS: on top is the absolute number of passing test
units and below that is the fraction of passing test units over the
total number of test units.
FAIL: on top is the absolute number of failing test
units and below that is the fraction of failing test units over the
total number of test units.
W, S, N: indicators that
show whether the warn, stop, or notify states
were entered; unset states appear as dashes, states that are set with
thresholds appear as unfilled circles when not entered and filled when
thresholds are exceeded (colors for W, S, and
N are amber, red, and blue)
EXT: a column that provides buttons to download data
extracts as CSV files for row-based validation steps having failing test
units. Buttons only appear when there is data to collect.
We see nothing in the W, S, and
N columns. This is because we have to explicitly set
thresholds for those to be active. We’ll do that next…
We often should think about what’s tolerable in terms of data quality
and implement that into our reporting. Let’s set proportional failure
thresholds to the warn, stop, and
notify states using the action_levels()
function.
# Create an `action_levels` object with the namesake function.
al <-
action_levels(
warn_at = 0.15,
stop_at = 0.25,
notify_at = 0.35
)
# This can be printed for inspection
al
## -- The `action_levels` settings
## WARN failure threshold of 0.15 of all test units.
## STOP failure threshold of 0.25 of all test units.
## NOTIFY failure threshold of 0.35 of all test units.
## ----
We are using threshold fractions of test units (between
0 and 1). For 0.15, this means
that if 15% percent of the test units are found to fail (i.e.,
don’t meet the expectation), then the designated failure state is
entered.
Absolute values starting from 1 can be used instead, and
this constitutes an absolute failure threshold (e.g., 10
means that if 10 of the test units are found to fail, the
failure state is entered).
What are test units? They make up the individual tests for a
validation step. They will vary by the validation function used but, in
simple terms, a validation function that validates values in a column
will have the number of test units equal to the number of rows. A
validation function that validates a column type will have exactly one
test unit. This is always given in the UNITS column of the
reporting table.
Let’s use the action_levels object in a new validation
process (based on the same small_table dataset). We’ll make
it so the validation functions used will result in more failing test
units.
agent_2 <-
create_agent(
tbl = small_table,
tbl_name = "small_table",
label = "Workshop agent No. 2",
actions = al
) %>%
col_is_posix(columns = date_time) %>%
col_vals_lt(columns = a, value = 7) %>%
col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>%
col_vals_between(columns = d, left = 0, right = 4000) %>%
col_is_logical(columns = e) %>%
col_is_character(columns = c(b, f)) %>%
col_vals_lt(columns = d, value = 9600) %>%
col_vals_in_set(columns = f, set = c("low", "mid")) %>%
rows_distinct() %>%
interrogate()
agent_2
| Pointblank Validation | |||||||||||||
| Workshop agent No. 2
tibble
small_tableWARN
0.15
STOP
0.25
NOTIFY
0.35
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_is_posix()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||||
| 3 | col_vals_regex()
|
|
✓ |
13 |
60.46 |
70.54 |
● |
● |
● |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120.92 |
10.08 |
○ |
○ |
○ |
||||
| 5 | col_is_logical()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 6 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 7 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 8 | col_vals_lt()
|
|
✓ |
13 |
120.92 |
10.08 |
○ |
○ |
○ |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70.54 |
60.46 |
● |
● |
● |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||
| 2022-10-13 00:03:53 EDT < 1 s 2022-10-13 00:03:54 EDT | |||||||||||||
Some notes:
warn, stop, and
notify states are presented in the table header; these are
defaults for every validation stepW, S, and N columns); steps
2, 3, 9, and 10 have
at least the warn conditionwarn state (look at steps 4 and
8); they still provide CSVs for failed rows but the
W indicator circle isn’t filled inHow you set the default thresholds will depend on how strict the
measure for data quality is. There might be certain validation steps
where we’d like to be more stringent. For the next validation process we
will apply the action_levels() function to individual
steps, overriding the default setting.
agent_3 <-
create_agent(
tbl = small_table,
tbl_name = "small_table",
label = "Workshop agent No. 3",
actions = al
) %>%
col_is_posix(columns = date_time) %>%
col_vals_lt(columns = a, value = 7) %>%
col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>%
col_vals_between(
columns = d,
left = 0,
right = 4000,
actions = action_levels( # Setting `actions` at the individual
warn_at = 1, # validation step. This time, using absolute
stop_at = 3, # threshold values (i.e., a single test unit
notify_at = 5 # failing triggers the `warn` state
)
) %>%
col_is_logical(columns = e) %>%
col_is_character(columns = c(b, f)) %>%
col_vals_lt(columns = d, value = 9600) %>%
col_vals_in_set(columns = f, set = c("low", "mid")) %>%
rows_distinct() %>%
interrogate()
agent_3
| Pointblank Validation | |||||||||||||
| Workshop agent No. 3
tibble
small_tableWARN
0.15
STOP
0.25
NOTIFY
0.35
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_is_posix()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||||
| 3 | col_vals_regex()
|
|
✓ |
13 |
60.46 |
70.54 |
● |
● |
● |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120.92 |
10.08 |
● |
○ |
○ |
||||
| 5 | col_is_logical()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 6 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 7 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 8 | col_vals_lt()
|
|
✓ |
13 |
120.92 |
10.08 |
○ |
○ |
○ |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70.54 |
60.46 |
● |
● |
● |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||
| 2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
Those CSV buttons are useful for sharing the report with others since they don’t even need to know R to get some use from those extracts. For the person familiar with R and pointblank, it is possible to get data frames for the failed rows (per validation step).
We can use the get_data_extracts() function to obtain a
list of data frames, or, use the i argument to get a data
frame available for a specific step. Not all steps will have associated
data frames. Also, not all validation functions will produce data frames
here (they need to check values in columns).
Let’s use get_data_extracts() on
agent_3.
get_data_extracts(agent = agent_3)
## $`2`
## # A tibble: 2 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 2 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
##
## $`3`
## # A tibble: 7 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 2 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 3 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 4 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 5 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
## 6 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
## 7 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
##
## $`4`
## # A tibble: 1 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
##
## $`8`
## # A tibble: 1 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
##
## $`9`
## # A tibble: 6 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 3 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 4 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 5 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 6 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
##
## $`10`
## # A tibble: 2 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 2 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
The list components are named for the validation steps that have data
extracts (i.e., filtered rows where test unit failures occurred). Let’s
get an individual data extract from step 9 (the
col_vals_in_set() step, which looked at column
f):
get_data_extracts(agent = agent_3, i = 9)
## # A tibble: 6 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 3 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 4 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 5 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 6 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
Sometimes, if your methodology allows for it, you want to use the
best part of the input data for something else. With the
get_sundered_data(), we use provide an agent object that
interrogated the data and what we get back could be:
Let’s make new agent and validate small_table again.
agent_4 <-
create_agent(
tbl = small_table,
tbl_name = "small_table",
label = "Workshop agent No. 4"
) %>%
col_vals_gt(columns = d, value = 1000) %>%
col_vals_between(
columns = c,
left = vars(a), right = vars(d), # Using values in columns, not literal vals
na_pass = TRUE
) %>%
interrogate()
agent_4
| Pointblank Validation | |||||||||||||
| Workshop agent No. 4
tibble
small_table
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_vals_gt()
|
|
✓ |
13 |
70.54 |
60.46 |
— |
— |
— |
||||
| 2 | col_vals_between()
|
|
✓ |
13 |
90.69 |
40.31 |
— |
— |
— |
||||
| 2022-10-13 00:03:55 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
Get the sundered data piece that contains only rows that passed both validation steps (this is the default piece). This yields 5 of 13 total rows.
agent_4 %>% get_sundered_data()
## # A tibble: 5 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 4 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 5 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
Get the complementary data piece: all of those rows that failed either of the two validation steps. This yields 8 of 13 total rows.
agent_4 %>% get_sundered_data(type = "fail")
## # A tibble: 8 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 2 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 3 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 4 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 5 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 6 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 7 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
## 8 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
We can get all of the input data returned with a flag column (called
.pb_combined). This is done by using
type = "combined" and that rightmost column will contain
"pass" and "fail" values.
agent_4 %>% get_sundered_data(type = "combined")
## # A tibble: 13 × 9
## date_time date a b c d e f .pb_c…¹
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-… 3 3423. TRUE high pass
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-… 8 10000. TRUE low pass
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-… 3 2343. TRUE high fail
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-… NA 3892. FALSE mid pass
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-… 7 284. TRUE low fail
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-… 4 3291. TRUE mid pass
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-… 3 843. TRUE high fail
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-… 2 1036. FALSE low fail
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high fail
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high fail
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-… 7 834. TRUE low fail
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-… 8 108. FALSE low fail
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-… NA 2230. TRUE high pass
## # … with abbreviated variable name ¹.pb_combined
The labels can be changed and this is flexible:
agent_4 %>% get_sundered_data(type = "combined", pass_fail = c(TRUE, FALSE))
## # A tibble: 13 × 9
## date_time date a b c d e f .pb_c…¹
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr> <lgl>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-… 3 3423. TRUE high TRUE
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-… 8 10000. TRUE low TRUE
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-… 3 2343. TRUE high FALSE
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-… NA 3892. FALSE mid TRUE
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-… 7 284. TRUE low FALSE
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-… 4 3291. TRUE mid TRUE
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-… 3 843. TRUE high FALSE
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-… 2 1036. FALSE low FALSE
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high FALSE
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high FALSE
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-… 7 834. TRUE low FALSE
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-… 8 108. FALSE low FALSE
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-… NA 2230. TRUE high TRUE
## # … with abbreviated variable name ¹.pb_combined
agent_4 %>% get_sundered_data(type = "combined", pass_fail = 0:1)
## # A tibble: 13 × 9
## date_time date a b c d e f .pb_c…¹
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr> <int>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-… 3 3423. TRUE high 0
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-… 8 10000. TRUE low 0
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-… 3 2343. TRUE high 1
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-… NA 3892. FALSE mid 0
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-… 7 284. TRUE low 1
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-… 4 3291. TRUE mid 0
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-… 3 843. TRUE high 1
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-… 2 1036. FALSE low 1
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high 1
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-… 9 838. FALSE high 1
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-… 7 834. TRUE low 1
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-… 8 108. FALSE low 1
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-… NA 2230. TRUE high 0
## # … with abbreviated variable name ¹.pb_combined
get_agent_x_list()The agent’s x-list is a record of information that the agent possesses at any given time. The x-list will contain the most complete information after an interrogation has taken place (before then, the data largely reflects the validation plan).
The x-list can be constrained to a particular validation step (by
supplying the step number to the i argument), or, we can
get the information for all validation steps by leaving i
unspecified.
Let’s obtain such a list from agent_3, which had 10
validation steps:
# Generate the `x_list` object from `agent_3`
x_list_3 <- agent_3 %>% get_agent_x_list
# Printing this gives us a console preview
# of which components are available
x_list_3
## -- The x-list for table `small_table`
## ---- ALL STEPS ----
## $time_start $time_end (POSIXct [1])
## $label $tbl_name $tbl_src $tbl_src_details (chr [1])
## $tbl (spec_tbl_df tbl_df tbl data.frame)
## $col_names $col_types (chr [8])
## $i $type $columns $values $label $briefs (mixed [10])
## $eval_error $eval_warning (lgl [10])
## $capture_stack (list [10])
## $n $n_passed $n_failed $f_passed $f_failed (num [10])
## $warn $stop $notify (lgl [10])
## $validation_set (tbl_df [10, 35])
## $lang (chr [1])
## $report_object (blastula_message)
## $report_html $report_html_small (chr [1])
## ----
The amount of information contained in here is comprehensive (see
?get_agent_x_list for a detailed breakdown) but we can
provide a few examples.
The number of test units in each validation step.
x_list_3$n
## [1] 1 13 13 13 1 1 1 13 13 13
The number of passing test units in each validation step.
x_list_3$n_passed
## [1] 1 11 6 12 1 1 1 12 7 11
The fraction of passing test units in each validation step.
x_list_3$f_passed
## [1] 1.00000 0.84615 0.46154 0.92308 1.00000 1.00000 1.00000 0.92308 0.53846
## [10] 0.84615
The warn, stop, and notify
states. We can arrange that in a tibble and use the step numbers
(i) as well.
dplyr::tibble(
step = x_list_3$i,
warn = x_list_3$warn,
stop = x_list_3$stop,
notify = x_list_3$notify
)
## # A tibble: 10 × 4
## step warn stop notify
## <int> <lgl> <lgl> <lgl>
## 1 1 FALSE FALSE FALSE
## 2 2 TRUE FALSE FALSE
## 3 3 TRUE TRUE TRUE
## 4 4 TRUE FALSE FALSE
## 5 5 FALSE FALSE FALSE
## 6 6 FALSE FALSE FALSE
## 7 7 FALSE FALSE FALSE
## 8 8 FALSE FALSE FALSE
## 9 9 TRUE TRUE TRUE
## 10 10 TRUE FALSE FALSE
email_create()We can choose to email the report if the notify state is
entered. The message can be created with the agent through the
email_create() function. Here’s a useful bit of code that
allows for conditional sending.
if (any(x_list_3$notify)) {
email_create(agent_3) %>%
blastula::smtp_send(
from = "sender@email.com",
to = "recipient@email.com",
credentials = creds_file(file = "email_secret")
)
}
Such code might be useful during an automated process where data is periodically checked and failures beyond thresholds require notification.
While email_create() will generate the email message
body, functions in the blastula package are responsible
for the sending of that email. For more information on sending HTML
email, look at the help article found by using
?blastula::smtp_send.
get_agent_report()We don’t have to fully accept the defaults for a data validation
report. Using get_agent_report() gives us options.
Here’s how you can change the title:
agent_3 %>% get_agent_report(title = "The **3rd** Example")
| The 3rd Example | |||||||||||||
| Workshop agent No. 3
tibble
small_tableWARN
0.15
STOP
0.25
NOTIFY
0.35
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_is_posix()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||||
| 3 | col_vals_regex()
|
|
✓ |
13 |
60.46 |
70.54 |
● |
● |
● |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120.92 |
10.08 |
● |
○ |
○ |
||||
| 5 | col_is_logical()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 6 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 7 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 8 | col_vals_lt()
|
|
✓ |
13 |
120.92 |
10.08 |
○ |
○ |
○ |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70.54 |
60.46 |
● |
● |
● |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||
| 2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
You can bring the steps that had serious failures up to the top:
agent_3 %>% get_agent_report(arrange_by = "severity")
| Pointblank Validation | |||||||||||||
| Workshop agent No. 3
tibble
small_tableWARN
0.15
STOP
0.25
NOTIFY
0.35
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | col_vals_regex()
|
|
✓ |
13 |
60.46 |
70.54 |
● |
● |
● |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70.54 |
60.46 |
● |
● |
● |
||||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120.92 |
10.08 |
● |
○ |
○ |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||
| 1 | col_is_posix()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 5 | col_is_logical()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 6 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 7 | col_is_character()
|
— |
|
✓ |
1 |
11.00 |
00.00 |
○ |
○ |
○ |
— | ||
| 8 | col_vals_lt()
|
|
✓ |
13 |
120.92 |
10.08 |
○ |
○ |
○ |
||||
| 2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
You can remove those steps that had no failures:
agent_3 %>% get_agent_report(arrange_by = "severity", keep = "fail_states")
| Pointblank Validation | |||||||||||||
| Workshop agent No. 3
tibble
small_tableWARN
0.15
STOP
0.25
NOTIFY
0.35
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | col_vals_regex()
|
|
✓ |
13 |
60.46 |
70.54 |
● |
● |
● |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70.54 |
60.46 |
● |
● |
● |
||||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120.92 |
10.08 |
● |
○ |
○ |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110.85 |
20.15 |
● |
○ |
○ |
||
| 2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
You can change the language of the report:
agent_3 %>% get_agent_report(lang = "de")
| Pointblank-Validierung | |||||||||||||
| Workshop agent No. 3
tibble
small_tableWARN
0,15
STOP
0,25
NOTIFY
0,35
|
|||||||||||||
| SCHRITT | SPALTEN | WERTE | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | col_is_posix()
|
— |
|
✓ |
1 |
11,00 |
00,00 |
○ |
○ |
○ |
— | ||
| 2 | col_vals_lt()
|
|
✓ |
13 |
110,85 |
20,15 |
● |
○ |
○ |
||||
| 3 | col_vals_regex()
|
|
✓ |
13 |
60,46 |
70,54 |
● |
● |
● |
||||
| 4 | col_vals_between()
|
|
✓ |
13 |
120,92 |
10,08 |
● |
○ |
○ |
||||
| 5 | col_is_logical()
|
— |
|
✓ |
1 |
11,00 |
00,00 |
○ |
○ |
○ |
— | ||
| 6 | col_is_character()
|
— |
|
✓ |
1 |
11,00 |
00,00 |
○ |
○ |
○ |
— | ||
| 7 | col_is_character()
|
— |
|
✓ |
1 |
11,00 |
00,00 |
○ |
○ |
○ |
— | ||
| 8 | col_vals_lt()
|
|
✓ |
13 |
120,92 |
10,08 |
○ |
○ |
○ |
||||
| 9 | col_vals_in_set()
|
|
✓ |
13 |
70,54 |
60,46 |
● |
● |
● |
||||
| 10 | rows_distinct()
|
— | — |
|
✓ |
13 |
110,85 |
20,15 |
● |
○ |
○ |
||
| 2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT | |||||||||||||
action_levels(); there can be default DQ thresholds and
step-specific thresholds (in both cases, supplied to
actions).get_data_extracts()).get_sundered_data())get_agent_x_list() function (useful for programming with
the validation results).email_create(); this integrates with
the blastula R package.get_agent_report().